Human activities have impacted the natural world to the extent that a term has been coined for the epoch that we live in today: the Anthropocene. It marks the start of an era of rapidly rising atmospheric CO2 levels and capricious climate. Maintaining our current rate of human activities is infeasible, as we are already experiencing unprecedented magnitudes of biodiversity loss and extinction.
With this report, I aim to utilise data collated in the Global Population Dynamics Database to find out the extent of the impact, as well as correlate it with atmospheric CO2 levels and emissions. I have always liked ecology, which is why I have chosen to embark on project, even though I have regrettably discontinued my studies in biology.
Nevertheless, I hope that my results can show how biodiversity is affected on various levels in different regions. This could possibly help to inform conservation efforts, or find out how effective currently implemented ones are.
a. How do atmospheric carbon dioxide levels correlate with biodiversity across the globe?
Overall, CO2 levels and biodiversity seem to have a negative correlation. On the phylum level of classification, all have negative correlations, other than Chordata which has a weak positive correlation, and Arthropoda, which has no correlation.
b. How do carbon dioxide emissions of continents correlate with biodiversity found there?
Overall, CO2 emissions and biodiversity also seem to have a negative correlation. On the phylum level of classification, all have strong negative correlations, other than Chordata which has a weak positive correlation, and Arthropoda, which has no correlation. It seems to correlate better with population data than global CO2 levels, as seen in Q1a.
a. Is the effect on biodiversity impacted by the biome the organisms live in?
Different biotopes and habitats have different correlations, but there is no obvious trend differentiating the biomes with negative correlations from those with positive ones.
b. How are organisms of different taxa affected differently over time?
Populations sorted by taxonomic class are affected negatively, other than Aves and Insecta.
c. How does the impact on biodiversity differ from region the region?
Various regions have different impacts on different phyla, though the overall correlation between population and time is still negative. Some trends can be explained by conservation efforts of countries.
Numbered list of dataset (with downloadable links) and a brief description of each dataset used. Draw reference to the numbering when describing methodology (data cleaning and analysis).
The Global Population Dynamics Database (GPDD) was downloaded from this site. It is a repository of biodiversity sampling research collated from many different sources. It contains a large amount of information, including taxonomic, location, and temporal data.
Data about atmospheric CO2 data was downloaded from this site.
Data about CO2 emissions per region was downloaded from this site.
Data visualisation and analysis
### Q1: Climate change and biodiversity a. How do atmospheric carbon dioxide levels correlate with the biodiversity across the globe?
b. How do carbon dioxide emissions of continents correlate with biodiversity found there?
### Q2: Impacts on biodiversity a. Is the effect on biodiversity impacted by the biome the organisms live in?
b. How are organisms of different taxa affected differently over time?
c. How does the impact on biodiversity differ from region the region?
Datasets were downloaded from the links under Dataset.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import math
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import scipy as sp
import plotly.express as px
import folium
import requests
import warnings
sns.set()
warnings.filterwarnings('ignore')
# GPDD
data=pd.read_csv('df35b.233.1-data.csv.csv')
main=pd.read_csv('df35b.234.1-main.csv.csv')
timeperiod=pd.read_csv('df35b.235.1-timeperiod.csv.csv')
taxon=pd.read_csv('df35b.236.1-taxon.csv.csv', encoding='latin')
datasource=pd.read_csv('df35b.237.1-datasource.csv.csv')
biotope=pd.read_csv('df35b.238.1-biotope.csv.csv')
location=pd.read_csv('df35b.239.1-location.csv.csv', encoding='latin')
# Annual atmospheric CO2 levels
co2=pd.read_csv('global-co-concentration-ppm.csv')
# Regional CO2 emissions
emissions=pd.read_csv('annual-co-emissions-by-region.csv')
Due to the complex nature of the GPDD, a lot of cleaning has to be performed.
The steps performed would roughly be:
After that, CO2 data corresponding to the years spanned by the time series of biodiversity data will be extracted, then correlated with the biodiversity data.
*This has to be done because some counts are in the tens while other are in the millions. To ensure that there is no over- or under-representation of data from a certain dataset, the data will be normalised. This is also due to the fact that the number of organisms counted (or any other proxy like density or biodiversity indices) is only from a small subsection of the Earth and not an absolute value, so normalising the data would provide a more accurate representation of global trends.
Firstly, we delete the irrelevant columns. The descriptions of the columns can be found in this user guide.
data.drop(['DataID',
'PopulationUntransformed',
'Generation',
'SeriesStep',
'DecimalYearBegin',
'DecimalYearEnd'
], axis=1, inplace=True)
data.head()
| MainID | Population | SampleYear | TimePeriodID | |
|---|---|---|---|---|
| 0 | 1 | 130.0 | 1959 | 408 |
| 1 | 1 | 0.0 | 1960 | 408 |
| 2 | 1 | 40.0 | 1961 | 408 |
| 3 | 1 | 90.0 | 1962 | 408 |
| 4 | 1 | 220.0 | 1963 | 408 |
To narrow the scope of the analysis, we will focus only on data with annual sampling. This is also done because the CO2 data is annual.
data=data.join(timeperiod[['TimePeriod', 'TimePeriodID']].set_index('TimePeriodID'), on='TimePeriodID')
data=data[data['TimePeriod'].isin(['Annual'])].reset_index().drop(['index', 'TimePeriodID'], axis=1)
data.head()
| MainID | Population | SampleYear | TimePeriod | |
|---|---|---|---|---|
| 0 | 1 | 130.0 | 1959 | Annual |
| 1 | 1 | 0.0 | 1960 | Annual |
| 2 | 1 | 40.0 | 1961 | Annual |
| 3 | 1 | 90.0 | 1962 | Annual |
| 4 | 1 | 220.0 | 1963 | Annual |
data.rename({'SampleYear': 'Year'}, axis=1, inplace=True)
After that, we will join main and data.
data=data.set_index('MainID').join(main.set_index('MainID')).reset_index().drop('MainID', axis=1)
data.head()
| Population | Year | TimePeriod | TaxonID | DataSourceID | BiotopeID | LocationID | SamplingUnits | SamplingProtocol | SourceDimension | ... | EndYear | SamplingFrequency | DatasetLength | Notes | SiblyFittedTheta | SiblyThetaCILower | SiblyThetaCIUpper | SiblyExtremeNEffect | SiblyReturnRate | SiblyCarryingCapacity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 130.0 | 1959 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | ... | 1988 | 1 | 30 | Source dominated landscape. | -0.5 | -31.6 | 1.6 | False | 0.4 | 66.1 |
| 1 | 0.0 | 1960 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | ... | 1988 | 1 | 30 | Source dominated landscape. | -0.5 | -31.6 | 1.6 | False | 0.4 | 66.1 |
| 2 | 40.0 | 1961 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | ... | 1988 | 1 | 30 | Source dominated landscape. | -0.5 | -31.6 | 1.6 | False | 0.4 | 66.1 |
| 3 | 90.0 | 1962 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | ... | 1988 | 1 | 30 | Source dominated landscape. | -0.5 | -31.6 | 1.6 | False | 0.4 | 66.1 |
| 4 | 220.0 | 1963 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | ... | 1988 | 1 | 30 | Source dominated landscape. | -0.5 | -31.6 | 1.6 | False | 0.4 | 66.1 |
5 rows × 27 columns
# Delete irrelevant columns
data.drop(['AssociatedDataSets',
'StartYear',
'EndYear',
'SamplingFrequency',
'DatasetLength',
'Notes',
'SiblyFittedTheta',
'SiblyThetaCILower',
'SiblyThetaCIUpper',
'SiblyExtremeNEffect',
'SiblyReturnRate',
'SiblyCarryingCapacity'
], axis=1, inplace=True)
data.head()
| Population | Year | TimePeriod | TaxonID | DataSourceID | BiotopeID | LocationID | SamplingUnits | SamplingProtocol | SourceDimension | SamplingEffort | SpatialDensity | SourceTransform | SourceTransformReference | Reliability | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 130.0 | 1959 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 1 | 0.0 | 1960 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 2 | 40.0 | 1961 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 3 | 90.0 | 1962 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 4 | 220.0 | 1963 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
Since we are dealing with time series, data with a SampleYear is -9999 (i.e. unknown) will be discarded.
data=data[data['Year'] != -9999].reset_index().drop(['index'], axis=1)
data.head()
| Population | Year | TimePeriod | TaxonID | DataSourceID | BiotopeID | LocationID | SamplingUnits | SamplingProtocol | SourceDimension | SamplingEffort | SpatialDensity | SourceTransform | SourceTransformReference | Reliability | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 130.0 | 1959 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 1 | 0.0 | 1960 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 2 | 40.0 | 1961 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 3 | 90.0 | 1962 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 4 | 220.0 | 1963 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
The column Reliability is based on a subjective relative ranking, where 1 = least reliable, and 5 = most reliable.
data['Reliability'].value_counts()
1 26239 -1 20340 2 13007 4 12757 3 5321 5 607 Name: Reliability, dtype: int64
Unfortunately, the largest proportion of data has a reliability rating of 1. Nonetheless, we will prioritise quality over quantity, and look at data with a reliability rating of 3 or more.
data=data[data['Reliability'] >= 3].reset_index().drop(['index'], axis=1)
data.head()
| Population | Year | TimePeriod | TaxonID | DataSourceID | BiotopeID | LocationID | SamplingUnits | SamplingProtocol | SourceDimension | SamplingEffort | SpatialDensity | SourceTransform | SourceTransformReference | Reliability | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 130.0 | 1959 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 1 | 0.0 | 1960 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 2 | 40.0 | 1961 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 3 | 90.0 | 1962 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 4 | 220.0 | 1963 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
Looking at the columns again, we realise that we cannot directly compare the populations of various samples. This is because due to different variables such as SamplingUnits, SamplingProtocol, SourceDimension, SamplingEffort, SpatialDensity, SourceTransform, and SourceTransformReference, the data must be transformed.
For the sake of simplicity, we will only take into consideration samples of the total population of the species. Nonetheless, relevant transformations must still be performed.
data['SourceTransformReference'].unique()
array(['None', 'Unknown', 'Log (10)', 'Index of abundance',
'Log (10) x +1', '1966 = 100', 'N at (t+1)/N at t', 'x 10^5',
'x 10^4', 'Standardised Density', 'x 10^3'], dtype=object)
data=data[~(data['SourceTransformReference'].isin(['N at (t+1)/N at t', 'Percentage change from max.']))]
data.head()
| Population | Year | TimePeriod | TaxonID | DataSourceID | BiotopeID | LocationID | SamplingUnits | SamplingProtocol | SourceDimension | SamplingEffort | SpatialDensity | SourceTransform | SourceTransformReference | Reliability | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 130.0 | 1959 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 1 | 0.0 | 1960 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 2 | 40.0 | 1961 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 3 | 90.0 | 1962 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 4 | 220.0 | 1963 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
data.groupby(['SourceTransformReference'])['Population'].describe().sort_values(['count'], ascending=False)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| SourceTransformReference | ||||||||
| None | 8923.0 | 4537.387124 | 93167.900444 | -0.7000 | 2.0000 | 10.0000 | 54.000000 | 3.600000e+06 |
| Unknown | 8293.0 | 73.448981 | 196.795477 | 0.0000 | 4.0000 | 16.0000 | 57.000000 | 3.957000e+03 |
| Standardised Density | 660.0 | -0.026364 | 0.977207 | -2.0000 | -0.8000 | -0.1000 | 0.600000 | 3.300000e+00 |
| Log (10) | 312.0 | 1.536250 | 0.962985 | -0.3000 | 1.0000 | 1.2750 | 2.062500 | 4.000000e+00 |
| Log (10) x +1 | 217.0 | 6.387097 | 10.009598 | 0.0000 | 1.0000 | 2.0000 | 6.000000 | 5.000000e+01 |
| Index of abundance | 90.0 | 99.722222 | 88.034215 | 0.0000 | 20.0000 | 90.5000 | 156.000000 | 3.520000e+02 |
| x 10^3 | 68.0 | 68.235294 | 71.854147 | 0.0000 | 8.7500 | 35.0000 | 113.250000 | 2.530000e+02 |
| x 10^4 | 40.0 | 6.570638 | 4.864414 | 0.8491 | 2.3137 | 6.0920 | 9.435125 | 1.719340e+01 |
| x 10^5 | 40.0 | 6.762707 | 0.908658 | 5.6144 | 5.9534 | 6.7055 | 7.108075 | 8.877100e+00 |
| 1966 = 100 | 32.0 | 104.093750 | 19.087843 | 60.0000 | 95.7500 | 100.0000 | 120.250000 | 1.380000e+02 |
Note that the SourceTransformReferences Log (10) and Log (10) x +1 can be 'untransformed' back into their actual Population counts.
def untransform(row):
if row['SourceTransformReference'] == 'Log (10)':
row['Population']=math.pow(10, row['Population'])
elif row['SourceTransformReference'] == 'Log (10) x +1)':
row['Population']=math.pow(10, row['Population'])+1
return row
data=data.apply(untransform, axis='columns').reset_index().drop(['index'], axis=1)
data.head()
| Population | Year | TimePeriod | TaxonID | DataSourceID | BiotopeID | LocationID | SamplingUnits | SamplingProtocol | SourceDimension | SamplingEffort | SpatialDensity | SourceTransform | SourceTransformReference | Reliability | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 130.0 | 1959 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 1 | 0.0 | 1960 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 2 | 40.0 | 1961 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 3 | 90.0 | 1962 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
| 4 | 220.0 | 1963 | Annual | 1 | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 |
Now, we can proceed with merging the other datasets along the columns DataSourceID, BiotopeID, LocationID, and TaxonID.
data=data.join(biotope.set_index('BiotopeID'), on='BiotopeID').drop(['BiotopeID'], axis=1)
data.head()
| Population | Year | TimePeriod | TaxonID | DataSourceID | LocationID | SamplingUnits | SamplingProtocol | SourceDimension | SamplingEffort | SpatialDensity | SourceTransform | SourceTransformReference | Reliability | HabitatName | BiotopeType | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 130.0 | 1959 | Annual | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | Boreal taiga forest | Coniferous forest |
| 1 | 0.0 | 1960 | Annual | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | Boreal taiga forest | Coniferous forest |
| 2 | 40.0 | 1961 | Annual | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | Boreal taiga forest | Coniferous forest |
| 3 | 90.0 | 1962 | Annual | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | Boreal taiga forest | Coniferous forest |
| 4 | 220.0 | 1963 | Annual | 1 | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | Boreal taiga forest | Coniferous forest |
For location, to simplify our analysis, we will only keep the columns Country, Continent, Ocean, LongDD, and LatDD.
location.columns
Index(['LocationID', 'ExactName', 'TownName', 'CountyStateProvince', 'Country',
'Continent', 'Ocean', 'LongitudeDegrees', 'LongitudeMinutes', 'EorW',
'LatitudeDegrees', 'LatitudeMinutes', 'NorS', 'LongDD', 'LatDD',
'North', 'East', 'South', 'West', 'Altitude', 'Area', 'Notes',
'SpatialAccuracy', 'LocationExtent'],
dtype='object')
data=data.join(location[['LocationID',
'Country',
'Continent',
'Ocean',
'LongDD',
'LatDD'
]].set_index('LocationID'), on='LocationID').drop(['LocationID'], axis=1)
data.head()
| Population | Year | TimePeriod | TaxonID | DataSourceID | SamplingUnits | SamplingProtocol | SourceDimension | SamplingEffort | SpatialDensity | SourceTransform | SourceTransformReference | Reliability | HabitatName | BiotopeType | Country | Continent | Ocean | LongDD | LatDD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 130.0 | 1959 | Annual | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | Boreal taiga forest | Coniferous forest | Russian Federation | Europe | Not applicable | 57.0 | 62.0 |
| 1 | 0.0 | 1960 | Annual | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | Boreal taiga forest | Coniferous forest | Russian Federation | Europe | Not applicable | 57.0 | 62.0 |
| 2 | 40.0 | 1961 | Annual | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | Boreal taiga forest | Coniferous forest | Russian Federation | Europe | Not applicable | 57.0 | 62.0 |
| 3 | 90.0 | 1962 | Annual | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | Boreal taiga forest | Coniferous forest | Russian Federation | Europe | Not applicable | 57.0 | 62.0 |
| 4 | 220.0 | 1963 | Annual | 1 | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | Boreal taiga forest | Coniferous forest | Russian Federation | Europe | Not applicable | 57.0 | 62.0 |
data['Ocean'].unique()
array(['Not applicable', 'North Pacific Ocean', 'South Pacific',
'Atlantic', 'North Sea', 'Pacific', 'Pacific Ocean'], dtype=object)
data['Ocean']=data['Ocean'].replace({'Pacific Ocean': 'Pacific', 'North Pacific Ocean': 'North Pacific'})
taxon.columns
Index(['TaxonID', 'TaxonName', 'WoldaCode', 'Authority', 'TaxonomicLevel',
'CommonName', 'TaxonomicPhylum', 'TaxonomicClass', 'TaxonomicOrder',
'TaxonomicFamily', 'TaxonomicGenus', 'Notes'],
dtype='object')
For taxon, once again, to simplify our analysis, we will just keep the columns CommonName, TaxonomicPhylum, TaxonomicClass, TaxonomicOrder, TaxonomicFamily, and TaxonomicGenus.
Phylum > Class > Order > Family > Genus > Species (Common name)
According to the Linnaean classifications system, the broadness of classification is as such, with phylum being the broadest umbrella term, and species being the most specific group of organisms.
data=data.join(taxon[['TaxonID',
'CommonName',
'TaxonomicPhylum',
'TaxonomicClass',
'TaxonomicOrder',
'TaxonomicFamily',
'TaxonomicGenus'
]].set_index('TaxonID'), on='TaxonID').drop(['TaxonID'], axis=1)
data.head()
| Population | Year | TimePeriod | DataSourceID | SamplingUnits | SamplingProtocol | SourceDimension | SamplingEffort | SpatialDensity | SourceTransform | ... | Continent | Ocean | LongDD | LatDD | CommonName | TaxonomicPhylum | TaxonomicClass | TaxonomicOrder | TaxonomicFamily | TaxonomicGenus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 130.0 | 1959 | Annual | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | ... | Europe | Not applicable | 57.0 | 62.0 | Hazel grouse | Chordata | Aves | Galliformes | Tetraonidae | Bonasa |
| 1 | 0.0 | 1960 | Annual | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | ... | Europe | Not applicable | 57.0 | 62.0 | Hazel grouse | Chordata | Aves | Galliformes | Tetraonidae | Bonasa |
| 2 | 40.0 | 1961 | Annual | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | ... | Europe | Not applicable | 57.0 | 62.0 | Hazel grouse | Chordata | Aves | Galliformes | Tetraonidae | Bonasa |
| 3 | 90.0 | 1962 | Annual | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | ... | Europe | Not applicable | 57.0 | 62.0 | Hazel grouse | Chordata | Aves | Galliformes | Tetraonidae | Bonasa |
| 4 | 220.0 | 1963 | Annual | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | ... | Europe | Not applicable | 57.0 | 62.0 | Hazel grouse | Chordata | Aves | Galliformes | Tetraonidae | Bonasa |
5 rows × 25 columns
data=data.pivot_table(index=[e for e in data.columns if e not in ('Year', 'Population')], columns='Year', values='Population')
data.head()
| Year | 1820 | 1821 | 1822 | 1823 | 1824 | 1825 | 1826 | 1827 | 1828 | 1829 | ... | 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TimePeriod | DataSourceID | SamplingUnits | SamplingProtocol | SourceDimension | SamplingEffort | SpatialDensity | SourceTransform | SourceTransformReference | Reliability | HabitatName | BiotopeType | Country | Continent | Ocean | LongDD | LatDD | CommonName | TaxonomicPhylum | TaxonomicClass | TaxonomicOrder | TaxonomicFamily | TaxonomicGenus | |||||||||||||||||||||
| Annual | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | Boreal taiga forest | Coniferous forest | Russian Federation | Europe | Not applicable | 57.000000 | 62.000000 | Hazel grouse | Chordata | Aves | Galliformes | Tetraonidae | Bonasa | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 115.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | All Individuals | Count | Count | Census | None | None | None | 4 | Lowland fen | Wetland | UK - England | Europe | Not applicable | -0.183333 | 52.433333 | Azure damselfly | Arthropoda | Insecta | Odonata | Coenagriidae | Coenagrion | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 52.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| Blue-tailed damselfly | Arthropoda | Insecta | Odonata | Coenagriidae | Ischnura | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 16.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |||||||||||||||||
| Common darter | Arthropoda | Insecta | Odonata | Libellulidae | Sympetrum | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 19.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |||||||||||||||||
| Emerald damselfly | Arthropoda | Insecta | Odonata | Lestidae | Lestes | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 92.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 178 columns
((data.shape[0]-data.isna().sum(axis=0))/data.shape[0]*100).plot(
title='Percentage of non-null values (aggregated from all time series) per year'
);
We see that a large portion of the years actually have close to all null values. We will focus our analysis on data with at least 10% non-null data, as that also corresponds with more recent years, which would be more relevant to the current global climate crisis we are facing. Note that the years with >= 10% non-null data forms a continuous stretch, with no gaps in between.
data.drop(data.columns[((data.shape[0]-data.isna().sum(axis=0))/data.shape[0] < 0.1)], axis=1, inplace=True)
data=pd.melt(data.reset_index(),
id_vars=[e for e in data.reset_index().columns if e not in data.columns],
var_name='Year',
value_name='Population'
)
data.head()
| TimePeriod | DataSourceID | SamplingUnits | SamplingProtocol | SourceDimension | SamplingEffort | SpatialDensity | SourceTransform | SourceTransformReference | Reliability | ... | LongDD | LatDD | CommonName | TaxonomicPhylum | TaxonomicClass | TaxonomicOrder | TaxonomicFamily | TaxonomicGenus | Year | Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Annual | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | ... | 57.000000 | 62.000000 | Hazel grouse | Chordata | Aves | Galliformes | Tetraonidae | Bonasa | 1950 | NaN |
| 1 | Annual | 3 | All Individuals | Count | Count | Census | None | None | None | 4 | ... | -0.183333 | 52.433333 | Azure damselfly | Arthropoda | Insecta | Odonata | Coenagriidae | Coenagrion | 1950 | NaN |
| 2 | Annual | 3 | All Individuals | Count | Count | Census | None | None | None | 4 | ... | -0.183333 | 52.433333 | Blue-tailed damselfly | Arthropoda | Insecta | Odonata | Coenagriidae | Ischnura | 1950 | NaN |
| 3 | Annual | 3 | All Individuals | Count | Count | Census | None | None | None | 4 | ... | -0.183333 | 52.433333 | Common darter | Arthropoda | Insecta | Odonata | Libellulidae | Sympetrum | 1950 | NaN |
| 4 | Annual | 3 | All Individuals | Count | Count | Census | None | None | None | 4 | ... | -0.183333 | 52.433333 | Emerald damselfly | Arthropoda | Insecta | Odonata | Lestidae | Lestes | 1950 | NaN |
5 rows × 25 columns
Finally, we are done with the data cleaning for the GPDD, and we will move on to the CO2 data.
co2
| Entity | Code | Year | CO2 concentrations (NOAA, 2018) | |
|---|---|---|---|---|
| 0 | World | OWID_WRL | 1 | 276.70 |
| 1 | World | OWID_WRL | 30 | 277.90 |
| 2 | World | OWID_WRL | 56 | 277.40 |
| 3 | World | OWID_WRL | 104 | 277.50 |
| 4 | World | OWID_WRL | 136 | 278.10 |
| ... | ... | ... | ... | ... |
| 218 | World | OWID_WRL | 2014 | 398.65 |
| 219 | World | OWID_WRL | 2015 | 400.83 |
| 220 | World | OWID_WRL | 2016 | 404.24 |
| 221 | World | OWID_WRL | 2017 | 406.55 |
| 222 | World | OWID_WRL | 2018 | 408.52 |
223 rows × 4 columns
co2=co2.drop(['Entity', 'Code'], axis=1).set_index('Year').rename({'CO2 concentrations (NOAA, 2018)': 'CO2'}, axis=1)
co2
| CO2 | |
|---|---|
| Year | |
| 1 | 276.70 |
| 30 | 277.90 |
| 56 | 277.40 |
| 104 | 277.50 |
| 136 | 278.10 |
| ... | ... |
| 2014 | 398.65 |
| 2015 | 400.83 |
| 2016 | 404.24 |
| 2017 | 406.55 |
| 2018 | 408.52 |
223 rows × 1 columns
We filter out the CO2 levels data corresponding to the years spanned by the population data.
co2=co2[(co2.index >= data['Year'][0]) & (co2.index <= data['Year'].iloc[-1])]
co2
| CO2 | |
|---|---|
| Year | |
| 1950 | 312.83 |
| 1952 | 312.18 |
| 1954 | 312.73 |
| 1955 | 314.71 |
| 1956 | 315.34 |
| 1957 | 315.34 |
| 1958 | 315.34 |
| 1959 | 315.97 |
| 1960 | 316.91 |
| 1961 | 317.64 |
| 1962 | 318.45 |
| 1963 | 318.99 |
| 1964 | 319.62 |
| 1965 | 320.04 |
| 1966 | 321.38 |
| 1967 | 322.16 |
| 1968 | 323.04 |
| 1969 | 324.62 |
| 1970 | 325.68 |
| 1971 | 326.32 |
| 1972 | 327.45 |
| 1973 | 329.68 |
| 1974 | 330.18 |
| 1975 | 331.11 |
| 1976 | 332.04 |
| 1977 | 333.83 |
| 1978 | 335.40 |
| 1979 | 336.84 |
| 1980 | 338.75 |
| 1981 | 340.11 |
| 1982 | 341.45 |
| 1983 | 343.05 |
| 1984 | 344.65 |
| 1985 | 346.12 |
| 1986 | 347.42 |
| 1987 | 349.19 |
| 1988 | 351.57 |
| 1989 | 353.12 |
Similarly, we filter out the CO2 emissions data corresponding to those years too.
emissions=emissions[emissions['Entity'].isin(['Africa',
'Asia',
'North America',
'South America',
'Europe',
'Antarctica',
'Australia'
])]
emissions=emissions.drop(['Code'], axis=1).rename({'Entity': 'Continent', 'Annual CO2 emissions (zero filled)': 'Emissions'},
axis=1
)
emissions=pd.pivot_table(emissions, index='Continent', columns='Year')
emissions.columns=emissions.columns.get_level_values(1)
emissions=emissions.T[(emissions.T.index >= data['Year'][0]) & (emissions.T.index <= data['Year'].iloc[-1])]
emissions.head()
| Continent | Africa | Antarctica | Asia | Australia | Europe | North America | South America |
|---|---|---|---|---|---|---|---|
| Year | |||||||
| 1950 | 93426522 | 0 | 483824105 | 54738653 | 2375472717 | 2739263884 | 112569753 |
| 1951 | 99832774 | 0 | 464252079 | 59029376 | 2621723173 | 2834638925 | 131276173 |
| 1952 | 108849402 | 0 | 515094063 | 60201351 | 2704892257 | 2762310744 | 141463469 |
| 1953 | 110774135 | 0 | 551281785 | 59434623 | 2781872541 | 2824982416 | 140631958 |
| 1954 | 116120274 | 0 | 604140401 | 67838562 | 2959523259 | 2712018147 | 151041453 |
data=data.pivot_table(index=[e for e in data.columns if e not in ('Year', 'Population')], columns='Year', values='Population')
data.head()
| Year | 1950 | 1951 | 1952 | 1953 | 1954 | 1955 | 1956 | 1957 | 1958 | 1959 | ... | 1980 | 1981 | 1982 | 1983 | 1984 | 1985 | 1986 | 1987 | 1988 | 1989 | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TimePeriod | DataSourceID | SamplingUnits | SamplingProtocol | SourceDimension | SamplingEffort | SpatialDensity | SourceTransform | SourceTransformReference | Reliability | HabitatName | BiotopeType | Country | Continent | Ocean | LongDD | LatDD | CommonName | TaxonomicPhylum | TaxonomicClass | TaxonomicOrder | TaxonomicFamily | TaxonomicGenus | |||||||||||||||||||||
| Annual | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | Boreal taiga forest | Coniferous forest | Russian Federation | Europe | Not applicable | 57.000000 | 62.000000 | Hazel grouse | Chordata | Aves | Galliformes | Tetraonidae | Bonasa | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 72.5 | ... | 45.0 | 125.0 | 177.5 | 75.0 | 25.0 | 40.0 | 25.0 | 42.5 | 115.0 | NaN |
| 3 | All Individuals | Count | Count | Census | None | None | None | 4 | Lowland fen | Wetland | UK - England | Europe | Not applicable | -0.183333 | 52.433333 | Azure damselfly | Arthropoda | Insecta | Odonata | Coenagriidae | Coenagrion | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 67.0 | 197.0 | 183.0 | 149.0 | 102.0 | 141.0 | 147.0 | 178.0 | 52.0 | NaN | |
| Blue-tailed damselfly | Arthropoda | Insecta | Odonata | Coenagriidae | Ischnura | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 22.0 | 27.0 | 43.0 | 45.0 | 32.0 | 22.0 | 15.0 | 12.0 | 16.0 | NaN | |||||||||||||||||
| Common darter | Arthropoda | Insecta | Odonata | Libellulidae | Sympetrum | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4.0 | 4.0 | 6.0 | 6.0 | 4.0 | 11.0 | 7.0 | 12.0 | 19.0 | NaN | |||||||||||||||||
| Emerald damselfly | Arthropoda | Insecta | Odonata | Lestidae | Lestes | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 9.0 | 155.0 | 188.0 | 62.0 | 89.0 | 66.0 | 93.0 | 69.0 | 92.0 | NaN |
5 rows × 40 columns
Before we proceed with the normalisation, it would be prudent to remove time series with too many zero values, since with a significant proportion of values are zero, the sample may tend to be less accurate due to the small size.
((data == 0).sum(axis=1)/(data.shape[1]-data.isnull().sum(axis=1))).sort_values().plot(
title=f'Ratio of zero to non-null values per time series from {data.columns[0]} to {data.columns[-1]} ({data.columns.size+1} years)'
)
x_axis=plt.gca().get_xaxis()
x_axis.set_visible(False)
data=data[((data == 0).sum(axis=1)/(data.shape[1]-data.isna().sum(axis=1))) <= 0.1]
# Min-Max normalisation
data_values=data.values
data_minmax=preprocessing.minmax_scale(data_values.T).T
data_values=pd.DataFrame(data_minmax)
data_values.index=data.index
data_values.columns=data.columns
data=data_values
data.head()
| Year | 1950 | 1951 | 1952 | 1953 | 1954 | 1955 | 1956 | 1957 | 1958 | 1959 | ... | 1980 | 1981 | 1982 | 1983 | 1984 | 1985 | 1986 | 1987 | 1988 | 1989 | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TimePeriod | DataSourceID | SamplingUnits | SamplingProtocol | SourceDimension | SamplingEffort | SpatialDensity | SourceTransform | SourceTransformReference | Reliability | HabitatName | BiotopeType | Country | Continent | Ocean | LongDD | LatDD | CommonName | TaxonomicPhylum | TaxonomicClass | TaxonomicOrder | TaxonomicFamily | TaxonomicGenus | |||||||||||||||||||||
| Annual | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | Boreal taiga forest | Coniferous forest | Russian Federation | Europe | Not applicable | 57.000000 | 62.000000 | Hazel grouse | Chordata | Aves | Galliformes | Tetraonidae | Bonasa | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.305882 | ... | 0.176471 | 0.552941 | 0.800000 | 0.317647 | 0.082353 | 0.152941 | 0.082353 | 0.164706 | 0.505882 | NaN |
| 3 | All Individuals | Count | Count | Census | None | None | None | 4 | Lowland fen | Wetland | UK - England | Europe | Not applicable | -0.183333 | 52.433333 | Azure damselfly | Arthropoda | Insecta | Odonata | Coenagriidae | Coenagrion | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0.325123 | 0.965517 | 0.896552 | 0.729064 | 0.497537 | 0.689655 | 0.719212 | 0.871921 | 0.251232 | NaN | |
| Blue-tailed damselfly | Arthropoda | Insecta | Odonata | Coenagriidae | Ischnura | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0.394737 | 0.526316 | 0.947368 | 1.000000 | 0.657895 | 0.394737 | 0.210526 | 0.131579 | 0.236842 | NaN | |||||||||||||||||
| Common darter | Arthropoda | Insecta | Odonata | Libellulidae | Sympetrum | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0.117647 | 0.117647 | 0.235294 | 0.235294 | 0.117647 | 0.529412 | 0.294118 | 0.588235 | 1.000000 | NaN | |||||||||||||||||
| Emerald damselfly | Arthropoda | Insecta | Odonata | Lestidae | Lestes | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0.042781 | 0.823529 | 1.000000 | 0.326203 | 0.470588 | 0.347594 | 0.491979 | 0.363636 | 0.486631 | NaN |
5 rows × 40 columns
(data.shape[1]-data.isna().sum(axis=1)).sort_values().plot(
title=f'Number of non-null values per time series from {data.columns[0]} to {data.columns[-1]} ({data.columns.size+1} years)'
)
x_axis=plt.gca().get_xaxis()
x_axis.set_visible(False)
See that most time series span across 10 years or more. 10 points would be enough to see the general trend. Hence, we will shave off time series with less than 10 points.
data=data[(data.shape[1]-data.isna().sum(axis=1)) >= 10]
data=pd.melt(data.reset_index(),
id_vars=[e for e in data.reset_index().columns if e not in data.columns],
var_name='Year',
value_name='Population'
)
data.head()
| TimePeriod | DataSourceID | SamplingUnits | SamplingProtocol | SourceDimension | SamplingEffort | SpatialDensity | SourceTransform | SourceTransformReference | Reliability | ... | LongDD | LatDD | CommonName | TaxonomicPhylum | TaxonomicClass | TaxonomicOrder | TaxonomicFamily | TaxonomicGenus | Year | Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Annual | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | ... | 57.000000 | 62.000000 | Hazel grouse | Chordata | Aves | Galliformes | Tetraonidae | Bonasa | 1950 | NaN |
| 1 | Annual | 3 | All Individuals | Count | Count | Census | None | None | None | 4 | ... | -0.183333 | 52.433333 | Azure damselfly | Arthropoda | Insecta | Odonata | Coenagriidae | Coenagrion | 1950 | NaN |
| 2 | Annual | 3 | All Individuals | Count | Count | Census | None | None | None | 4 | ... | -0.183333 | 52.433333 | Blue-tailed damselfly | Arthropoda | Insecta | Odonata | Coenagriidae | Ischnura | 1950 | NaN |
| 3 | Annual | 3 | All Individuals | Count | Count | Census | None | None | None | 4 | ... | -0.183333 | 52.433333 | Common darter | Arthropoda | Insecta | Odonata | Libellulidae | Sympetrum | 1950 | NaN |
| 4 | Annual | 3 | All Individuals | Count | Count | Census | None | None | None | 4 | ... | -0.183333 | 52.433333 | Emerald damselfly | Arthropoda | Insecta | Odonata | Lestidae | Lestes | 1950 | NaN |
5 rows × 25 columns
data=pd.merge(data, co2, left_on='Year', right_index=True, how='outer').reset_index().drop(['index'], axis=1)
data['Emissions']=0
def add_emissions(row):
if row['Continent'] == 'Unknown' or row['Continent'] == 'Not applicable':
return row
else:
row['Emissions']=emissions.loc[row['Year']][row['Continent']]
return row
data=data.apply(add_emissions, axis='columns')
data.head()
| TimePeriod | DataSourceID | SamplingUnits | SamplingProtocol | SourceDimension | SamplingEffort | SpatialDensity | SourceTransform | SourceTransformReference | Reliability | ... | CommonName | TaxonomicPhylum | TaxonomicClass | TaxonomicOrder | TaxonomicFamily | TaxonomicGenus | Year | Population | CO2 | Emissions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Annual | 1 | All Individuals | Count | Density | Sample | per 1000ha^2 | None | None | 4 | ... | Hazel grouse | Chordata | Aves | Galliformes | Tetraonidae | Bonasa | 1950 | NaN | 312.83 | 2375472717 |
| 1 | Annual | 3 | All Individuals | Count | Count | Census | None | None | None | 4 | ... | Azure damselfly | Arthropoda | Insecta | Odonata | Coenagriidae | Coenagrion | 1950 | NaN | 312.83 | 2375472717 |
| 2 | Annual | 3 | All Individuals | Count | Count | Census | None | None | None | 4 | ... | Blue-tailed damselfly | Arthropoda | Insecta | Odonata | Coenagriidae | Ischnura | 1950 | NaN | 312.83 | 2375472717 |
| 3 | Annual | 3 | All Individuals | Count | Count | Census | None | None | None | 4 | ... | Common darter | Arthropoda | Insecta | Odonata | Libellulidae | Sympetrum | 1950 | NaN | 312.83 | 2375472717 |
| 4 | Annual | 3 | All Individuals | Count | Count | Census | None | None | None | 4 | ... | Emerald damselfly | Arthropoda | Insecta | Odonata | Lestidae | Lestes | 1950 | NaN | 312.83 | 2375472717 |
5 rows × 27 columns
data.drop([e for e in data.columns if e not in ('HabitatName',
'BiotopeType',
'Country',
'Continent',
'Ocean',
'LongDD',
'LatDD',
'CommonName',
'TaxonomicPhylum',
'TaxonomicClass',
'TaxonomicOrder',
'TaxonomicFamily',
'TaxonomicGenus',
'Year',
'Population',
'CO2',
'Emissions'
)], axis=1, inplace=True)
data.head()
| HabitatName | BiotopeType | Country | Continent | Ocean | LongDD | LatDD | CommonName | TaxonomicPhylum | TaxonomicClass | TaxonomicOrder | TaxonomicFamily | TaxonomicGenus | Year | Population | CO2 | Emissions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Boreal taiga forest | Coniferous forest | Russian Federation | Europe | Not applicable | 57.000000 | 62.000000 | Hazel grouse | Chordata | Aves | Galliformes | Tetraonidae | Bonasa | 1950 | NaN | 312.83 | 2375472717 |
| 1 | Lowland fen | Wetland | UK - England | Europe | Not applicable | -0.183333 | 52.433333 | Azure damselfly | Arthropoda | Insecta | Odonata | Coenagriidae | Coenagrion | 1950 | NaN | 312.83 | 2375472717 |
| 2 | Lowland fen | Wetland | UK - England | Europe | Not applicable | -0.183333 | 52.433333 | Blue-tailed damselfly | Arthropoda | Insecta | Odonata | Coenagriidae | Ischnura | 1950 | NaN | 312.83 | 2375472717 |
| 3 | Lowland fen | Wetland | UK - England | Europe | Not applicable | -0.183333 | 52.433333 | Common darter | Arthropoda | Insecta | Odonata | Libellulidae | Sympetrum | 1950 | NaN | 312.83 | 2375472717 |
| 4 | Lowland fen | Wetland | UK - England | Europe | Not applicable | -0.183333 | 52.433333 | Emerald damselfly | Arthropoda | Insecta | Odonata | Lestidae | Lestes | 1950 | NaN | 312.83 | 2375472717 |
First, let us define some terminology.
Correlations
Strong: |r| > 0.5
Moderate: 0.5 > |r| > 0.3
Weak: 0.3 > |r| > 0.2
Negligible/None: 0.2 > |r|
Indeterminate: Data points are weird, hence no clear-cut conclusion can be drawn.
Next, we define some functions to modularise the code.
def get_annotate(factor):
def annotate(data, **kws):
r, p=sp.stats.pearsonr(data[factor], data['Population'])
ax=plt.gca()
ax.text(.05, .93, 'r={:.3f}, p={:.3g}'.format(r, p), transform=ax.transAxes)
return annotate
# Plots the relationship between populations of species grouped by 'var' (e.g. Continent) and 'factor' (e.g. CO2)
# threshold is the minimum magnitude of r the plot must have (to filter out more significant correlations)
# multiplier takes a value of -1 to filter out negative correlations, +1 for positive, 0 for both
def lm_plot(var, factor, data=data, threshold=0, multiplier=0, sharex=True):
mod=format_mod(data=data, var=var, factor=factor, threshold=threshold, multiplier=multiplier)
g=sns.lmplot(data=mod,
x=factor,
y='Population',
sharex=sharex,
col=var,
hue=var,
col_wrap=min(len(mod[var].unique()), 4)
)
fig=g.fig
fig.suptitle('Graph of Population against {} over {}'.format(factor, var), fontsize=24, y=1.05)
g.map_dataframe(get_annotate(factor));
def format_mod(data, var, factor, threshold, multiplier):
# Measurement of central tendency needed to combine large number of data points to make visualisation clearer
# Median is used to estimate the central tendency as it is resistant and the min-max normalisation is prone to outliers
if factor == 'Year':
mod=data.groupby(['Year', var]).median().dropna().reset_index()
else:
mod=data.groupby(['Year', var]).median().reset_index().set_index('Year').dropna().reset_index()
for v in mod[var].unique():
if (np.isnan((sp.stats.pearsonr(mod[mod[var] == v][factor], mod[mod[var] == v]['Population'])[0]))
or (abs(sp.stats.pearsonr(mod[mod[var] == v][factor], mod[mod[var] == v]['Population'])[0]) < threshold)
or (sp.stats.pearsonr(mod[mod[var] == v][factor], mod[mod[var] == v]['Population'])[0]*multiplier < 0)):
mod=mod[mod[var] != v]
return mod
def heatmap(index, columns='Year', data=data, category=''):
heatmap_data=pd.pivot_table(data, values='Population', index=[index], columns=columns)
plt.figure(figsize=(20, 8));
sns.heatmap(heatmap_data, cmap='YlGnBu')
if category == '':
plt.title('Population against {} over {}'.format(columns, index), fontsize=24, y=1.05);
else:
plt.title('Population against {} over {} for {}'.format(columns, index, category), fontsize=24, y=1.05);
def multiscatter_plot(hue, col, lm=False, data=data):
if lm:
g=sns.lmplot(data=data.groupby([hue, 'Year', col]).median().dropna().reset_index(),
x='Year',
y='Population',
col=col,
hue=hue,
col_wrap=3
)
else:
g=sns.relplot(data=data.groupby([hue, 'Year', col]).median().dropna().reset_index(),
x='Year',
y='Population',
col=col,
hue=hue,
col_wrap=3
)
fig=g.fig
fig.suptitle('Graph of Population against Year over {}'.format(col), fontsize=24, y=1.05);
def get_r(var, factor, data=data):
mod=format_mod(data=data, var=var, factor=factor, threshold=0, multiplier=0)
r_values=[]
for v in data[var].unique():
r,p=sp.stats.pearsonr(data[data[var] == v].dropna()[factor], data[data[var] == v].dropna()['Population'])
r_values.append(r)
df=pd.DataFrame(data=[np.around(r_values, 3), np.around(np.absolute(np.array(r_values)), 3)],
index=['r', 'abs_r'], columns=data[var].unique()
).T
df['sign']=0
def get_sign(row):
if row['r'] < 0:
row['sign']='-'
else:
row['sign']='+'
return row
return df.apply(get_sign, axis='columns')
def choropleth(title, var='', val='', color='Population'):
if var != '' and val != '':
px.choropleth(data[data[var] == val],
locations='IsoAlpha',
title=title,
animation_frame='Year',
color=color,
projection='equirectangular'
).show()
else:
px.choropleth(data,
locations='IsoAlpha',
title=title,
animation_frame='Year',
color=color,
projection='equirectangular'
).show()
def sunburst(by, title='', data=data):
mod=data.groupby(by).count()
mod['Value']=1
mod=mod[['Value']]
mod.reset_index(inplace=True)
# Note that this would make the smallest sectors of the sunburst be of equal size, so as to see all of them clearly
px.sunburst(mod, path=by, title=title + ' (Equal size)', width=900, height=600).show()
# This, on the other hand, makes the sizes of the sectors proportional to the number of non-null values of population
px.sunburst(data[data['Population'].notna()], path=by, title=title + ' (Proportional to value count)',
width=900, height=600).show()
To make choropleth maps, we need to add the ISO country codes.
data.Country.unique()
array(['Russian Federation', 'UK - England', 'Norway', 'UK - Wales',
'USA', 'United Kingdom', 'UK - Scotland', 'Netherlands', 'Canada',
'Gibraltar', 'Multiple or International jurisdiction', 'Tanzania',
'UK - Dependency', 'Denmark', 'Antarctica', 'South Africa',
'Ecuador', 'Australia', 'Japan', 'Sweden', 'Finland', 'Hungary',
'France', 'Germany', 'Estonia'], dtype=object)
iso_codes={
'Russian Federation':'RUS',
'UK - England': 'GBR',
'Norway': 'NOR',
'UK - Wales': 'GBR',
'USA': 'USA',
'United Kingdom': 'GBR',
'UK - Scotland': 'GBR',
'Netherlands': 'NLD',
'Canada': 'CAN',
'Gibraltar': 'GIB',
'Multiple of International jurisdiction': '',
'Tanzania': 'TZA',
'UK - Dependency': 'GBR',
'Denmark': 'DNK',
'Antarctica': 'ATA',
'South Africa': 'ZAF',
'Ecuador': 'ECU',
'Australia': 'AUS',
'Japan': 'JPN',
'Sweden': 'SWE',
'Finland': 'FIN',
'Hungary': 'HUN',
'France': 'FRA',
'Germany': 'DEU',
'Estonia': 'EST'
}
data['IsoAlpha']=data['Country'].map(iso_codes)
This is done before the actual in-depth analysis of the data to get a better idea of what we are working with.
sunburst(by=['TaxonomicPhylum', 'TaxonomicClass', 'TaxonomicOrder', 'TaxonomicFamily'],
title='A circular taxonomic dendrogram of the organisms sampled'
)
From these two sunburst diagrams, we see that Chordata and Arthropoda are the most well-documented phyla, and within those categories, Aves and Insecta are respectively the classes with the most amount of data.
This is perhaps due to the nature of the phylum and the sample. Arthropods form the most diverse and populous phyla. They include insects, crustaceans, and spiders, amongst many others. Whereas chordates comprise all vertebrates (animals which have a backbone, which is already very diverse), sea squirts and lancelets.
Within the phyla, Aves (birds) and Insecta (insects) could have been sampled the most because organisms can be relatively easily counted, by doing a visual count for the birds, and insect sweeping for insects.
sunburst(by=['Continent', 'Country'], title='A sunburst of the countries from which samples were taken')
Interestingly, about 75% of data collected is from Europe, with a disproportionate amount coming from England alone. The bulk of the remainder is contributed by North America. Hence, data collated from Europe and North America will be studied more in-depth later on.
fig=px.choropleth(pd.merge(pd.DataFrame(data['IsoAlpha'].unique()),
data[['IsoAlpha', 'Country']],
left_on=0,
right_on='IsoAlpha',
how='left'
).drop_duplicates(),
locations='IsoAlpha',
hover_name='Country',
title='A map of the countries from which samples were taken',
projection='equirectangular'
)
fig.update_layout(showlegend=False)
fig.show()
mod=data.groupby(['Year', 'IsoAlpha']).count()[['Population']].reset_index().rename({'Population': 'Sample Count'}, axis=1)
px.choropleth(mod[mod['Sample Count'] > 0],
locations='IsoAlpha',
color='Sample Count',
animation_frame='Year',
title='Yet another map of the countries from which samples were taken',
projection='equirectangular'
).show()
As corroborated by the previous visualisations, most of South America, Africa, and Asia have no data. Thus, the further analysis below may only be applicable to other parts of the world with more data, notably Europe and North America.
sunburst(by=['Continent', 'Country', 'TaxonomicPhylum', 'TaxonomicClass'], title='Classes of organisms studied per country')
Here, we see that most of the Insecta data is contributed by England. However, the Aves data is contributed by a lot of countries all over the world. In fact, the majority of the countries surveyed have contributed some Aves data. Thus, under data analysis, Aves population across different countries will be investigated.
sunburst(by=['Continent', 'TaxonomicPhylum'], title='Phyla of organisms studied per continent')
From the above charts, we see that Chordata data was taken from all continents, but it is the only phylum for Africa, Antarctica, Asia, and Australia. Meanwhile, South America has two phyla (Chordata, Angiospermophyta), North America has four (Mollusca, Chordata, Arthropoda, Angiospermophyta), while Europe has six (Monera, Mollusca, Chromophyta (Heterokontophyta), Chordata, Arthropoda, Angiospermophyta). This is in line with the fact that most samples were done in Europe and North America, hence it would make sense that both regions have more diversity in the samples taken.
sunburst(data=data[data['Ocean'] != 'Not applicable'],
by=['Ocean', 'TaxonomicPhylum', 'TaxonomicClass', 'TaxonomicOrder', 'TaxonomicFamily', 'TaxonomicGenus', 'CommonName'],
title='Organisms studied per ocean'
)
Mosot data is collated from onshore sites (possibly due to the diffculty of collecting underwater population data). Of the data collected in the ocean, most is from the Pacific. However, even with that being said, there are still less than 150 data points from all the oceans combined. The populations of only six individual species is definitely insufficient to draw a representative conclusion of biodiversity in the ocean. Conclusions drawn from ocean data should be taken with an adequate pinch of salt.
sns.lineplot(data=data, x='Year', y='CO2')
plt.title(label='Atmospheric CO2 levels increase (approximately) linearly over time', size=18, y=1.05);
lm_plot(var='Continent', factor='CO2')
Negative correlation
Positive correlation
No correlation/Indeterminate
For the continents with negative correlations, note that the trends are both strong, whereas the strength of the positive correlations are weak. This might mean that on the whole, biodiversity across the globe is decreasing with increasing atmospheric CO2 levels.
For continents with more data points (Europe, North America), the correlation is negligible. This could be because many time series of different species in various environments with conflicting parameters were aggregated and plotted on a single graph, which 'cancels out', resulting in no significant observable correlation.
lm_plot(var='TaxonomicPhylum', factor='CO2')
Positive correlation
No correlation/Indeterminate
The negative correlations are stronger than the positive ones. This could mean that overall, biodiversity is negatively affected by increasing CO2 levels. This is bad because CO2 levels are still on the rise, which would continue to negatively affect biodiversity even more.
# Countries highlighted are those with at least one time series
choropleth(color='Emissions', title='CO2 emissions of continents are increasing over time')
mod=data.groupby(['Year', 'Continent']).mean().reset_index()
px.pie(mod[mod['Continent'] != 'Not applicable'],
values='Emissions',
names='Continent',
title='Average proportion of CO2 Emissions over the years'
).show()
px.area(data[data['Continent'] != 'Not applicable'], x='Year', y='Emissions', color='Continent',
title='CO2 Emissions quadrupled in 40 years').show()
px.bar(data[data['Continent'] != 'Not applicable'], x='Continent', y='Emissions', color='Continent',
animation_frame='Year', animation_group='Continent', range_y=[0,8000000000],
title='Europe, North America, Asia are the major contributors to CO2 emissions').show()
lm_plot(var='TaxonomicPhylum', factor='Emissions', sharex=False)
Negative correlation
Positive correlation
No correlation/Indeterminate
The negative correlations are stronger than the positive one. This could mean that overall, biodiversity is negatively affected by CO2 emissions in the region they inhabit.
Also, refer to these results, where a similar graph was plotted, but against global atmospheric CO2 levels. The correlations against emissions for each region are stronger (see that all the negative correlations here are considered to be strong), which could mean that emissions per region could be a better indicator of declining populations. In other words, it should be recommended that various regions keep their CO2 emissions in check to avoid catastrophic decline in biodiversity.
heatmap(data=data[data['TaxonomicPhylum'] == 'Arthropoda'], index='TaxonomicClass', category='Arthropoda')
heatmap(data=data[data['TaxonomicPhylum'] == 'Arthropoda'], index='TaxonomicOrder', category='Arthropoda')
heatmap(data=data[data['TaxonomicPhylum'] == 'Arthropoda'], index='TaxonomicFamily', category='Arthropoda')
heatmap(data=data[data['TaxonomicPhylum'] == 'Arthropoda'], index='TaxonomicGenus', category='Arthropoda')
heatmap(data=data[data['TaxonomicPhylum'] == 'Arthropoda'], index='CommonName', category='Arthropoda')
Phylum > Class > Order > Family > Genus > Species (Common name)
There are many different hierarchical levels, according to the widely-used Linaaean system of classification seen above. However, looking at Genus and Species, the classification seems to be too specific to make any reasonable general observation. Hence, we will focus on Family, since it is still relatively broad, yet some trends can be elucidated.
heatmap(data=data[data['TaxonomicPhylum'] == 'Arthropoda'], index='TaxonomicFamily', category='Arthropoda')
From this, we see that there are some negative correlations (e.g. Arctiidae, Noctuidae, Tortricidae), but also some positive correlations (e.g. Coenagriidae, Nymphalidae), while the rest still do not have any clear correlation.
However, for those with some correlation (be it positive or negative), such data can help in making informed decisions on conservation efforts (whether its required/if those already in place are effective or not).
lm_plot(var='Continent', factor='Emissions', sharex=False)
Negative correlation
No correlation/Indeterminate
Once again, we see that the negative correlations are stronger than the positive ones. This could be that on the whole, biodiversity is decreasing as time passes and emissions increase.
lm_plot(var='Ocean', factor='Year')
Negative correlation
Positive correlation
No correlation/Indeterminate
Note that most of the data points in the North Sea are either 0 or 1, other than one data point in the middle. This is because there is very little variation of the values (i.e. three unique values), hence the normalisation results in only three unique values, which mostly fall exactly at 0 or 1, other than the outlier at 0.2.
Next, we investigate biodiversity across various Habitats and Biotopes. Note that Habitats form a subset of Biotope.
lm_plot(var='HabitatName', factor='Year', threshold=0.2, multiplier=-1)
Firstly, the r value is used as a rough gauge to measure the trend. Note that this is only an approximation since the trend may not be linear. However, it is used because it would be difficult to estimate what the actually trend is, thought sometimes it can be seen from the data points (e.g. 'Woodland (coniferous)' in the first row seems like it follows an exponential decay graph.
Also, even though all the graphs displayed here have a moderate negative correlation (due to the relatively high r value), some correlations might not be as strong as the appear to be. For example, for 'Moist to wet mixed heathland', the data points do not seem to follow a clear decreasing trend.
lm_plot(var='HabitatName', factor='Year', threshold=0.2, multiplier=1)
There are also many other habitat types that follow a moderate positive linear trend. However, note that that number of positive correlations is less than the number of negative ones (19 vs 22), which could once again point to an overall decreasing trend of biodiversity, across various habitats.
Let us now visualise where these habitats are situated across the globe.
def plot_world_r(var):
r_data=data.join(get_r(var, 'Year'), on=var)
r_data=r_data.groupby([var, 'LongDD', 'LatDD', 'sign']).mean()[['r', 'abs_r']].reset_index()
r_data=r_data[r_data['abs_r'] > 0.2]
m=folium.Map(location=[20,0], tiles='OpenStreetMap', zoom_start=2)
for i in range(0,len(r_data)):
folium.CircleMarker(
location=[r_data.iloc[i]['LatDD'], r_data.iloc[i]['LongDD']],
popup='{}: {}<br><br>r: {}'.format(var, r_data.iloc[i][var], round(r_data.iloc[i]['r'], 3)),
radius=float(r_data.iloc[i]['abs_r'])*30,
color='green' if r_data.iloc[i]['sign'] == '+' else 'red',
fill=True,
fill_color='green' if r_data.iloc[i]['sign'] == '+' else 'red'
).add_to(m)
item_text='<br> {item} <i class="fa fa-circle fa-2x" style="color:{col}"></i>'
html_items=item_text.format(item='Negative r:', col= 'red')+item_text.format(item='Positive r:', col= 'green')
legend_html = '''
<div style="
position: fixed;
bottom: 50px; left: 50px; width: 150px; height: 100px;
border:2px solid grey; z-index:9999;
background-color:white;
opacity: .85;
font-size:14px;
font-weight: bold;
">
{title}
{item_text}
</div> '''.format(title='Legend', item_text=html_items)
m.get_root().html.add_child(folium.Element(legend_html))
title_html = '''
<h3 align="center" style="font-size:20px"><b>{}</b></h3>
'''.format(
'Graph of r values (correlation between Population and Year) over {} across the globe'.format(var)
)
m.get_root().html.add_child(folium.Element(title_html))
return m
plot_world_r('HabitatName')
We identify a few trends.
In North America, all of the correlations are negative, other than two positive ones.
In the UK, many study on many different habitats have been performed. Surprisingly, the majority of correlations are positive, i.e. biodiversity has been increasing over time. This could mean that conservation efforts in the UK have been effective on the whole.
However, in the rest of Europe (notably Finland), the correlations are mostly negative. It is possible that UK puts in the most effort into biodiversity conservation. Interestingly, according to an article ranking European nations most committed to wildlife conservation based on their Google search volumes for certain key phrases, the UK tops almost all the rankings.
lm_plot(var='BiotopeType', factor='Year', threshold=0.2)
Negative correlation
Positive correlation
It seems like the trend can be negative or positive, depending on the Biotope, and there is no clear distinction between Biotopes with negative correlations and those with positive correlations.
Similar to Habitat, we can plot the r values between Population and Year (for different Biotopes) on the world map.
plot_world_r('BiotopeType')
Unfortunately, it appears that most of the correlations are weak, leading to very few data points being displayed on the map. Oddly though, only one of these several correlations displayed is positive. Thus, by grouping time series by the Biotope, most of the correlations becomes negligible, and those that are moderately strong seem to point towards a decrease in biodiversity with time.
px.pie(data.dropna()['BiotopeType'].value_counts().to_frame().reset_index(),
values='BiotopeType',
names='index',
).show()
The top four Biotopes (excluding 'Unspecified habitat or no information') are interestingly, all densely vegetated areas (forests and grassland), though this could just be a coincidence. We continue our investigation into these four Biotopes.
multiscatter_plot(hue='TaxonomicPhylum', col='BiotopeType', lm=True, data=data[data['BiotopeType'].isin(
data.dropna()['BiotopeType'].value_counts().to_frame().reset_index().iloc[[0,1,3,4]]['index']
)])
From this, we see that even in the same biotope, different phyla are affected differently.
Deciduous forest: Angiospermophyta population drops, while that for Arthropoda seems to increase. There is no clear correlation for Chordata.
Grassland: The population of Mollusca drops abruptly, while that of Chordata decreases to a minimum around 1970, before increasing again. As for Angiospermophyta, there does not seem to be a clear correlation. Arthropoda seems to follow a strong positive correlation.
Coniferous forest: Arthropoda populations seem to follow a weak positive correlation, while Chordata follows a weak negative correlation.
Mixed or unspecified forest: The population for Chordata increases. The population for Mollusca follows a weak negative correlation. There is no clear correlation for Arthropoda.
This shows how it may be an overgeneralisation to draw conclusions on the biodiversity populations without considering different classifications of organisms. Of course, we could continue making the classifications even more specific (e.g. even down to the exact species investigated), but as seen in earlier portions, this would make the trends seen overly specific and hence ungeneralisable, which is not what we aim to achieve.
Using this information, we can also see what groups of organisms conservation efforts should target. For example, Chordata populations do not seem to be decreasing in these four biotopes, but Mollusca populations decrease in the two biotopes where samples of them were taken.
lm_plot(var='TaxonomicPhylum', factor='Year')
Negative correlation
Positive correlation
No correlation/Indeterminate
This correlations are in line with the correlations of the populations of various phyla against CO2 levels and emissions, which makes sense because they have also been rising over time.
lm_plot(var='TaxonomicClass', factor='Year', threshold=0.2, multiplier=-1)
Negative correlation
It is interesting to note that those with strong correlations are simpler life forms, compared to those with moderate/weak correlations. Such organisms could be more susceptible to environmental changes.
Bacillariophyceae: Diatoms are a key group of photosynthetic protists which have a unique glass-like wall made of SiO2 (silicon dioxide) embedded in an organic matrix. They play an important role in carbon sequestration (i.e. which helps reduce CO2 levels). Hence, their decline (which may not be due to increased CO2 levels, since they themselves are photosynthetic, which requires CO2), is worrying, and if other research does indeed show that diatom populations are decreasing on a large scale, action should be taken. This includes fertilising the ocean with essential nutrients, such as iron, to encourage diatom blooms. However, it must be ensured that there is minimal disruption to other parts of the marine ecosystem.
Bivalvia: As CO2 levels increase, more CO2 dissolves in the ocean, producing carbonic acid (H2CO3). This decreases the pH of the ocean, resulting in a phenomenon known as ocean acidification. This negatively impacts the buildup of the calcium carbonate (more specifically calcite and aragonite) shells of bivalves, resulting in impeded larval shell development and greater susceptibility to predators.
Gastropoda: Due to human activities like deforestation which destroy habitats, the populations of gastropods could have been adversely affected.
Angiospermopsida (Dicotyledoneae)/Angiospermopsida (Monocotyledoneae): These form up the bulk of flowering plants, and are very diverse in terms of genotype and phenotype. Thus, different species are affected to different extents, but the overall trend is still a moderately declining one.
Mammalia: This is yet another vary diverse class of animals, including bats, dugongs, quokkas, and humans (note that Homo sapiens were not included in this biodiversity study). Due to many conflicting factors, the trends for various subtaxa could have cancelled one another out after aggregation.
*'Unknown' is not included here, though it appears to have a very strong correlation. This is because 'Unknown' agglomerates all the data points whose TaxonomicClass is ambiguous/undetermined.
lm_plot(var='TaxonomicClass', factor='Year', threshold=0.2, multiplier=1)
Positive correlation
Firstly, we see that the number of TaxonomicClasses that have a non-negligible positive correlation is much lower than that for negative correlations (2 positive, compared to 7 negative). We also realise that the positive correlations, on average, are weaker than the negative ones. This could mean that overall, biodiversity is on the decline. Of course, the r value of a linear regression line is merely a very approximate estimation of correlation, as it is clearly seen that the trend for Aves had a dip around 1975-1985, before increasing again; the trend for Insecta was decreasing from 1950-1965, before increasing till 1989.
Aves: The impact of birds can be very different. For non-migratory/flightless birds, they would be restricted to a smaller region and their populations could be more affected by changes in the environment. However, for migratory birds, which can even traverse across continents, their behaviour could be the one that is more majorly affected. For example, if the sample was performed in an area where migratory birds fly to during a certain season, yet environmental pressures were such that those birds would increasingly flock to that area, a perceived increase in population would be erroneously registered.
Insecta: Insects can be negatively affected by CO2 levels, as excessive CO2 partial pressure can induce hypoxia, which is the lack of oxygen. And yet, the trend seen here is an increasing one. This is anomalous. However, insects are also highly sensitive to other biotic (e.g. presence of predators, vegetation) and abiotic (e.g. air quality, pesticides) factors. However, pollution has also been on the rise, especially in major cities around the world, so this trend is weird.
More analysis will done on these two classes.
heatmap(data=data[data['TaxonomicClass'] == 'Aves'],
index='TaxonomicOrder',
category='Aves'
)
However, it seems like there still isn't much of a correlation for most of the orders displayed above. Only perhaps Strigiformes shows an increasing trend, while Pelecaniformes shows an increasing trend till 1967, before dropping drastically in 1968, and then increasing back up again.
heatmap(data=data[data['TaxonomicClass'] == 'Insecta'],
index='TaxonomicOrder',
category='Insecta'
)
The above plot does not have any clear trend, so we will delve deeper into the specific taxonomic families within the orders.
heatmap(data=data[data['TaxonomicClass'] == 'Insecta'],
index='TaxonomicFamily',
category='Insecta'
)
However, for insects we do see a clear trend that seems to elucidate more about the earlier weakly positively correlated regression plot. For example, there are some families which show an obvious decline (e.g. Arctiidae, Carabidae, Libellulidae, Tortricidae). However, others show an increase (e.g. Coenagriidae, Lycaenidae, Nymphalidae). Some others still show no obvious correlation.
Nonetheless, this can provide valuable data regarding the populations of specific families of insects, and can give insights about conservation efforts, for example, if it needs to be implemented, or if current efforts are effective.
multiscatter_plot(data=data[data['Continent'] != 'Not applicable'], hue='TaxonomicPhylum', col='Continent', lm=True)
North America: Angiospermophyta and Mollusca populations follow strong negative correlations with time. Chordata seems to follow a weak positive correlation, while the correlation is negligible for Arthropoda.
Europe: Chromophyta (Heterokontophyta), Mollusca, and Monera populations follow strong negative correlations with time. Angiospermophyta follows a weaker negative correlation. Meanwhile, Arthropoda and Chordata populations both follow weak positive correlations.
South America: Angiospermophyta and Chordata follow negative correlations. That for Angiospermophyta is weak, while Chordata's is strong.
The rest: The four other continents only have Chordata data. Chordata in Antarctica and Africa follow weak positive correlations, while in Australia, there is a strong negative correlation. However, in Asia, the correlation is negligible.
Overall, other than Arthropoda (whose population is either non-increasing or weakly increasing) and Chordata (whose trend varies vastly across continents), the phyla have a downward trend.
heatmap(data=data[(data['Continent'] == 'North America') & (data['TaxonomicPhylum'] == 'Arthropoda')],
index='CommonName',
category='Arthropoda in North America'
)
In fact, there are only two species of arthropods investigated here: the Dungeness crab and the eastern spruce budworm (a species of moth). Though it appeared that there was no correlation for Arthropoda in North America, we realise that the eastern spruce budworm population had in fact drastically decreased, while that for the Dungeness crab varied quite periodically. This 'masked' the trend for the eastern spruce budworm.
heatmap(data=data[(data['Continent'] == 'Asia')],
index='CommonName',
category='Asia'
)
We see that there are two time series, which coincindentally segue perfectly into each other, which resulted in a scatterplot with seemingly no correlation. In fact, the greater lizardfish, which lives in Indo-Pacific waters, faced a rather rapid decline in population, while the red-backed vole population did not have any clear trend.
As promised earlier (somewhere in the EDA section), we will investigate the impacts on Aves (birds) populations across different countries. This is because many countries have collected data on birds and it would be interesting to compare their populations.
lm_plot(data=data[data['TaxonomicClass'] == 'Aves'], var='Country', factor='Year', threshold=0.2, multiplier=-1)
Negative correlation
There are many factors that could have caused the decline in bird populations. This includes the replacement of native species' natural habitat with concrete jungles. Since birds rely on trees to roost and to feed, they would be negatively affected. Such effects impact both endangered and least-concern species of birds, but the ramifications on endangered species is certainly more pronounced. Endangered bird populations in Australia declined by 52% from 1985-2015. This is very bad, because even after the timeframe of the data shown above, populations have still continued to decline. Perhaps conservation efforts should be implemented or strengthened in these countries.
lm_plot(data=data[data['TaxonomicClass'] == 'Aves'], var='Country', factor='Year', multiplier=1)
Positive correlation
It seems like the UK on the whole seems to be faring fine here. However, the UK Department for Environment, Food and Rural Affairs found that bird populations, including farmland birds, woodland birds, water and wetland birds, and seabirds had their population decline after 1980. Nonetheless, the fact that the UK has conducted studies on this is reassuring as it would help inform conservation efforts.
*Though Multiple or International jurisdiction seems to have a strong negative correlation, its data is from various parts of the world and hence should not be considered.
The low r value for Canada is due to its obvious U-shaped trend, which has a dip around 1960-1970. Coincidentally, the nature conservation movement was kickstarted around then, with the establishment of The Canadian Wildlife Federation in 1961, the National and Provincial Parks Association of Canada (now the Canadian Parks and Wilderness Society) in 1963, the World Wildlife Fund Canada in 1967 and the Canadian arm of the Sierra Club in 1970. Federal and provincial governments also set up ministries of the environment to enact legislation regarding its conservation. It seems like these efforts have borne fruit, from the rise in bird populations after 1970.
We define some functions to aid in the training and testing of some MLRMs, which integrates the earlier sections investigating the impacts of increased CO2 levels and emissions from various regions on biodiversity together with looking into how these impacts vary across regions and taxa.
def mlr(data, groupby, unique):
# We remove these data because they should not contribute to the training and testing of the MLRM
mod=data.copy()
mod=mod[(mod['HabitatName'] != 'No information supplied in source document')
& (mod['BiotopeType'] != 'Unspecified habitat or no information')
& (mod['Country'] != 'Multiple or International jurisdiction')
& (mod['Continent'] != 'Not applicable')
& (mod['CommonName'] != 'Unknown')
& (mod['TaxonomicClass'] != 'Unknown')
& (mod['TaxonomicOrder'] != 'Unknown')
& (mod['TaxonomicFamily'] != 'Unknown')
& (mod['Emissions'] > 0)
& (mod['IsoAlpha'].notnull())]
mod=mod.groupby(groupby).median().dropna()[['Population']].reset_index()
for c in mod.select_dtypes(include=[object]):
mod=mod.join(pd.get_dummies(mod[c])).drop([c], axis=1)
mlm=LinearRegression()
x_train, x_test, y_train, y_test=train_test_split(mod.drop(['Population'], axis=1),
mod['Population'], test_size=0.2, random_state=0
)
mlm.fit(x_train, y_train)
yhat=mlm.predict(x_test)
return (y_test, yhat,
pd.DataFrame(index=mod.drop(['Population'], axis=1).columns, data=mlm.coef_).T
)
pd.set_option('display.float_format', '{:.2E}'.format)
def plot_mlr(unique, groupby, data=data):
nrow=math.ceil(float(len(data[unique].unique())/4))
ncol=4
fig, axes=plt.subplots(nrow, ncol,figsize=(15, math.ceil(float(len(data[unique].unique())/4))*3))
coefs=None
count=0
for r in range(nrow):
for c in range(0, ncol, 2):
current=data[unique].unique()[count]
mod=data[data[unique] == current]
try:
y_test, yhat, coef=mlr(mod[mod['Population'].notna()],
groupby=groupby,
unique=unique
)
coefs=coef.rename({0: current}) if coefs is None else pd.concat([coef.rename({0: current}), coefs], sort=False)
plt.suptitle('Categorised by ' + ', '.join(groupby[3:]), size=20)
# Distplots
sns.distplot(y_test, color='r', label='Actual Value', ax=axes[r, c]);
sns.distplot(yhat, color ='b', label='Fitted Value', ax=axes[r, c]);
axes[r, c].legend(['Actual', 'Predicted'])
axes[r, c].set_title('{}={}'.format(unique, current), size=16)
# Predicted vs Actual
sns.regplot(x=y_test, y=yhat, ax=axes[r, c+1]).plot([0, 1], [0,1], color='red', alpha=0.3)
axes[r, c+1].set_xlim(0, 1)
axes[r, c+1].set_ylim(0, 1)
axes[r, c+1].set_xlabel('Actual')
axes[r, c+1].set_ylabel('Predicted')
count+=1
# When there are too few data points to split the dataset into training and testing data, an error is thrown
except (NameError, ValueError) as e:
print(e)
continue
fig.tight_layout()
plt.show()
'''
This tells us about how different parameters affect Population. For example, if the coefficient is negative, this means that
it negatively correlates with Population, and vice versa. It also allows us to compare the relative extent to which various
variables a
'''
return coefs
plot_mlr(unique='TaxonomicClass', groupby=['Year', 'CO2', 'Emissions', 'Continent'])
| Year | CO2 | Emissions | Europe | South America | North America | Africa | Asia | Australia | |
|---|---|---|---|---|---|---|---|---|---|
| Angiospermopsida (Monocotyledonae) | -3.99E-01 | 2.47E-01 | 4.77E-10 | -1.95E+00 | 1.95E+00 | NAN | NAN | NAN | NAN |
| Angiospermopsida (Dicotyledoneae) | 1.74E-01 | -1.07E-01 | -6.80E-10 | NAN | NAN | 0.00E+00 | NAN | NAN | NAN |
| Crustacea | 6.85E-03 | -1.47E-02 | 9.02E-11 | NAN | NAN | 0.00E+00 | NAN | NAN | NAN |
| Mammalia | 4.17E-02 | -2.07E-02 | -1.58E-10 | 3.43E-01 | NAN | 9.81E-02 | -5.68E-01 | 1.27E-01 | NAN |
| Insecta | -9.29E-02 | 5.53E-02 | 2.21E-10 | -5.19E-02 | NAN | 5.19E-02 | NAN | NAN | NAN |
| Aves | -1.09E-02 | -1.82E-03 | 1.31E-10 | -4.55E-01 | 3.12E-01 | -2.24E-01 | NAN | NAN | 3.67E-01 |
From the graphs, we see that most of the MLRMs are suitable, other than perhaps Aves and Crustacea.
Next, we direct our attention to the coefficient table*. Let us analyse it.
Unfortunately, due to the lack of data, there is not much that can be deduced about the coefficients regarding the continents. However, an interesting anomaly is that the signs of the coefficients for Angiospermopsida (Monocotyledonae) and Angiospermopsida (Dicotyledonae) for the variables Year, CO2, and Emissions are all opposite, even though they are both part of the phylum Angiospermophyta. This could just be due to the unrepresentativity of the data (i.e. small sample size across few regions), as such vast differences would probably not be chalked up to fundamental differences in their genotype.
*The orders of magnitudes of the coefficients across columns is meangingless. This is because the Year is in the thousands, CO2 is in the hundreds, Emissions is in the billions and the other columns are binary (i.e. 0/1) after one-hot encoding. Hence, this would directly affect the order of magnitude of the coefficient. Thus, we will only compare the coefficients across rows.
Overall, biodiversity, whether sorted by taxon, biome, or region, has generally been on the decline. Populations also generally have negative correlations with CO2 levels and emissions. Of course, some exceptions have been seen, notably Chordata and Arthropoda. However, as noted earlier as well, this could just be due to the broadness of the phylum, and conflicting factors result in mixed results varying across subtaxa, that upon aggregation, are indistinguishable.
Several assumptions were made in this report.
One is that in real life, populations do not usually vary linearly, but many linear regression lines were plotted, and the r value was used to determine the sign and the strength of the correlation. However, this author decided against utilising a polynomial or exponential fit because it is (1) computationally expensive which would eat up lots of time which he lacks and (2) possible that overfitting can result. Hence, linear regression is a naïve but not totally unworkable method to estimate correlation strength.
It was also assume that the samples taken are representative of biodiversity there. This is, of course, not necessarily universally true, especially so for specific groups which have very little data (e.g. ocean data, data of rare species). However, the Global Population Dynamics Database is already a very comprehensive repository, and this author just decided to go with what he had.
Notwithstanding the above assumptions, some recommendations can still be put forward, based on the analyses. For example, several reasons as to how CO2 emissions could directly cause biodiversity loss have been proposed. Even if it is not a causation relationship, there could also be other correlations. For instance, wildfires release sequestered carbon in the form of CO2, whilst also destroying habitat, razing vegetation, and displacing fauna.
Moreover, the impacts on biodiversity depending on region were investigated here. In particular, the correlations of bird populations were investigated as many countries had data on birds. Possible reasons explaining whether certain measures were working (or not) were given. Thus, for countries with decreasing populations of a certain taxon should consider increasing conservation efforts. Whereas others, like Canada, which have seen how a surge in environmental action have positively impacted bird populations, should continue to maintain and even reinforce such efforts. We might need to take stronger action to protect our plant population, as the Angiospermophyta populations were on the decline for the three continents which had data on it, namely North America, South America, and Europe. These recommendations should be extended to all countries globally. This is also because plants are the primary producers from which ecosystems branch. By safeguarding the fundamental roots of the food web, other organisms would also be protected.
Anyway, on the whole, this report has just reiterated (yet again) the importance of biodiversity conservation. Biodiversity is suffering because of humans. If we cast it aside, we will just stay on the track towards ecological collapse and mass extinction.
This report is also not based on the most updated data (all samples were from the middle to late 20th century), and yet we see such terrifying trends. Some further work could be done on more recent data, if made available, to see how biodiversity is faring in the modern day.
Bibliography
Hummel, M., Environmental Movement in Canada (2020). In The Canadian Encyclopedia. Retrieved from https://www.thecanadianencyclopedia.ca/en/article/environmental-and-conservation-movements
Khaliq, A., Javed, M., Sohail, M., & Sagheer, M. (2014). Environmental effects on insects and their population dynamics. Journal of entomology and zoology studies, 2, 1-7.
Lewis, S. L., & Maslin, M. A. (2015). Defining the anthropocene. Nature, 519(7542), 171–180. https://doi.org/10.1038/nature14258
Simmonds, J., Salazar, A., Watson, J., & Maron, M. (2019, October 29). Australia’s beloved native birds are disappearing – and the cause is clear. The Guardian. Retrieved October 2, 2021, from https://www.theguardian.com/environment/2019/oct/29/australias-beloved-native-birds-are-disappearing-and-the-cause-is-clear.
UK Department for Environment, Food and Rural Affairs. (2020). Wild Bird Populations in the UK, 1970 to 2019. Biodiversity Statistics Team, UK Department for Environment, Food and Rural Affairs.
Urry, L. A., Cain, M. L. 1., Wasserman, S. A., Minorsky, P. V., Reece, J. B., & Campbell, N. A. (2017). Essential biology. Eleventh edition. New York, NY: Pearson Education, Inc.
Waldbusser, G., Hales, B., Langdon, C. et al. Saturation-state sensitivity of marine bivalve larvae to ocean acidification. Nature Clim Change 5, 273–280 (2015). https://doi.org/10.1038/nclimate2479
Datasets
https://knb.ecoinformatics.org/view/doi:10.5063/F1BZ63Z8 (Global Population Dynamics Database)
https://ourworldindata.org/atmospheric-concentrations (Atmospheric CO2 levels)
https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions (CO2 emissions)
Here lie a few cool looking visualisations that did not make it to the final report.
fig=px.scatter_3d(
format_mod(data=data, var='TaxonomicPhylum', factor='Year', threshold=0, multiplier=0).sort_values('TaxonomicPhylum'),
x='Year', y='Population', z='Emissions', color='TaxonomicPhylum'
).show()
px.scatter_geo(data.dropna(),
lat='LatDD',
lon='LongDD',
animation_frame='Year',
title='Populations across countries',
size='Population',
color='Country',
projection='equirectangular'
).show()
px.scatter_geo(data.dropna(),
lat='LatDD',
lon='LongDD',
animation_frame='Year',
title='Populations across biotopes',
size='Population',
color='BiotopeType',
projection='equirectangular'
).show()